Regression Models

Author

Jorge Bris Moreno

Assignment by: Dr. Purna Gamage

Instructions

Submission:

Optional:

Import

Code
#knitr::opts_chunk$set(include = FALSE) # for making prompts
knitr::opts_chunk$set(echo = TRUE) # for making solutions
knitr::opts_chunk$set(fig.width = 3)
library(knitr)
library(tidyverse)
library(modeldata)
library(leaps)
library(caret)
library(corrplot)
library(MASS)
library(ISLR)
library(glmnet)
library(gam)
library(ggplot2)
library(dplyr)

HW-3.1: Stock returns

Use polynomials and ridge regression to predict stock returns

This problem uses the built in EuStockMarkets dataset. The dataset contains time series of closing prices of major European stock indices from 1991 to 1998. We use only the FTSE column in this problem. The dataset is a time series object, but you will need to extract the FTSE column and make it into a data frame.

HW-3.1a:

Fit polynomial models of degrees 4, 8, 12 to the FTSE data and plot all three fitted curves together with a scatterplot of the data. Comment on the plots. Which features in the data are resolved by the polynomial models? Which features are not resolved? Do the polynomial curves show any artifacts such as oscillations?

Code
# GET DATA
data("EuStockMarkets")
print(head(EuStockMarkets))
         DAX    SMI    CAC   FTSE
[1,] 1628.75 1678.1 1772.8 2443.6
[2,] 1613.63 1688.5 1750.5 2460.2
[3,] 1606.51 1678.6 1718.0 2448.2
[4,] 1621.04 1684.1 1708.1 2470.4
[5,] 1618.16 1686.6 1723.1 2484.7
[6,] 1610.61 1671.6 1714.3 2466.8
Code
ftse <- EuStockMarkets[,4]
Code
print(attributes(ftse))
$tsp
[1] 1991.496 1998.646  260.000

$class
[1] "ts"
Code
typeof(ftse)
[1] "double"
Code
# INSERT CODE HERE 

# Load my theme
my_theme <- readRDS('~/.Rthemes/my_theme.rds')

# make ftse a df
ftse_df <- data.frame(date = seq(as.Date("1991-01-01"), by = "day", length.out = length(ftse)), FTSE = as.vector(ftse))

# Models
fit4 <- lm(FTSE ~ poly(date, 4, raw=FALSE), data = ftse_df)
fit8 <- lm(FTSE ~ poly(date, 8, raw=FALSE), data = ftse_df)
fit12 <- lm(FTSE ~ poly(date, 12, raw=FALSE), data = ftse_df)

# Predictions
pred4 <- predict(fit4, newdata = ftse_df)
pred8 <- predict(fit8, newdata = ftse_df)
pred12 <- predict(fit12, newdata = ftse_df)
Code
# Plot
ggplot(ftse_df, aes(x = date)) +
  geom_point(aes(y = FTSE)) +
  geom_line(aes(y = pred4, color = "Degree 4"), size = 1.5) +
  geom_line(aes(y = pred8, color = "Degree 8"), size = 1.5) +
  geom_line(aes(y = pred12, color = "Degree 12"), size = 1.5) +
  scale_color_manual(
    name = "Degree of Poly",
    values = c("Degree 4" = "pink", "Degree 8" = "royalblue1", "Degree 12" = "darkolivegreen1")) +
  labs(title = "FTSE over time & Fitted Polynomial Curves",
       x = "Date", y = "FTSE") +
  my_theme() +
  theme(panel.grid = element_line(color = "grey"))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

The best one seems to be the one of degree 12, followed by the one of degree 8 and then 4. That is because the higher degree the more it can follow the patterns over time. However, while all of them seem to follow the trend, they all lack the ability to capture the ups and downs over time (probably due to seasons or months in the year). They show some osculations (the higher degree the more) but still cannot capture the complexity that times adds into the data. However, some oscillation are not in the data but they are shown, especially for degree 8 and 12, where they drop down at the beggining and at the end of the dataset. These are signs of some over fitting.

HW-3.1.b:

Use ridge regression to regularize the polynomial model of degree 12. Use \(\lambda_{1}SE\). Plot the resulting polynomial model onto the the data and comment on it.

Code
# Make model matrix (degree 12)
features <- model.matrix(~ poly(as.numeric(ftse_df$date), 12, raw = FALSE), data = ftse_df)[, -1] 

set.seed(444)

# Cross Validation
cv <- cv.glmnet(features, ftse_df$FTSE, alpha = 0)
# lambda 1se (optimal lambda with 1 standard error rule)
lambda_1SE <- cv$lambda.1se 

# model & predictions
ridge_model <- glmnet(features, ftse_df$FTSE, alpha = 0)
ridge_predictions <- predict(ridge_model, s = lambda_1SE, newx = features)

# Plot
plot(ftse_df$date, ftse_df$FTSE, type = "l", xlab = "Date", ylab = "FTSE", main = "Ridge Regression on Degree 12 Polynomial")
lines(ftse_df$date, ridge_predictions, col = "red")
legend("bottomright", legend = c("Actual", "Ridge Prediction"), col = c("black", "red"), lwd = 2, bty = "n")

Some over fitting has been cut down through Ridge regression, but still have the endings of the model drop dramatically, which might be a problem for predicting future data. Ridge regression, while penalizing the magnitude of the coefficients, doesn’t seem to fully work for what was mentioned.

HW-3.2: Advertising budgets

Improve advertising budgets using GAMs

Use the Advertising dataset, which can either be found here. Split the data into a training and test set (70% / 30%).

Code
# GET DATA
set.seed(441)
ads <- read_csv('https://www.statlearning.com/s/Advertising.csv')
ads <- ads[,-1]

# Splitting data
trainIndex <- sample(seq_len(nrow(ads)), size = floor(0.7 * nrow(ads)))

train_data <- ads[trainIndex, ]
test_data <- ads[-trainIndex, ]

# For our checking
paste("rows of training data:", nrow(train_data), "rows of testing data:", nrow(test_data))
[1] "rows of training data: 140 rows of testing data: 60"

HW-3.2a:

Fit generalized additive models to predict sales, using smoothing splines of degrees 2, 3, 4, 5, 6 for the three predictors. How do the rms prediction errors compare to the rms prediction error of a multiple regression model on the training set? On the test set?

Code
# INSERT CODE HERE 

# Linear Model
lm_model <- lm(sales ~ TV + radio + newspaper, data = train_data)

# Predictions on test and training
lm_predictions_train <- predict(lm_model, newdata = train_data)
lm_predictions_test <- predict(lm_model, newdata = test_data)

# RMS
lm_rms_train <- sqrt(mean((lm_predictions_train - train_data$sales)^2))
lm_rms_test <- sqrt(mean((lm_predictions_test - test_data$sales)^2))

print(paste("Linear Model RMS on Training Set:", lm_rms_train))
[1] "Linear Model RMS on Training Set: 1.58165007636303"
Code
print(paste("Linear Model RMS on Test Set:", lm_rms_test))
[1] "Linear Model RMS on Test Set: 1.93558979182608"
Code
# INSERT CODE HERE 
# Set up the degrees, predictors, and storing for models and names
degrees <- 2:6
predictors <- c("TV", "radio", "newspaper")
gam_mods <- list()
mod_names <- vector()

# Loop through all degrees
for (degree in degrees) {
  # Formula
  gam_form <- as.formula(paste("sales ~", paste("ns(", predictors, ", df =", degree, ")", collapse = " + ")))
  
  # Fit model and store
  gam_mod_name <- paste("GAM_Degree", degree, sep = "_")
  gam_mods[[gam_mod_name]] <- gam(gam_form, data = train_data)
  mod_names <- c(mod_names, gam_mod_name)}

# RMSE function
calculateRMSE <- function(actual, predicted) {
  sqrt(mean((actual - predicted)^2))}

# to store RMSE
rmse_train <- setNames(numeric(length(mod_names)), mod_names)
rmse_test <- setNames(numeric(length(mod_names)), mod_names)

# RMSE for each GAM model (training)
for (mod_name in names(gam_mods)) {
  predictions1 <- predict(gam_mods[[mod_name]], newdata = train_data, type = "response")
  rmse_train[mod_name] <- calculateRMSE(train_data$sales, predictions1)}

# RMSE for each GAM model (testing)
for (mod_name in names(gam_mods)) {
  predictions2 <- predict(gam_mods[[mod_name]], newdata = test_data, type = "response")
  rmse_test[mod_name] <- calculateRMSE(test_data$sales, predictions2)}

# RMSE values
print("RMSE for each model on the train set:")
[1] "RMSE for each model on the train set:"
Code
print(rmse_train)
GAM_Degree_2 GAM_Degree_3 GAM_Degree_4 GAM_Degree_5 GAM_Degree_6 
    1.359051     1.319147     1.265214     1.249750     1.201283 
Code
print("RMSE for each model on the test set:")
[1] "RMSE for each model on the test set:"
Code
print(rmse_test)
GAM_Degree_2 GAM_Degree_3 GAM_Degree_4 GAM_Degree_5 GAM_Degree_6 
    1.966195     1.809398     1.836296     1.847180     1.930020 

For the training set, all the errors are lower than the one for multiple regression, and they lower even more the higher the degree, which is expected. However, in the test set, the model with degree 2 seems to have about or a tiny bit over the error of the multiple regression model while the other ones have lower. The one with degree 3 seems to do the best while then, the error slowly increases for the following degrees.

HW-3.2.b:

Is there evidence of overfitting?

INSERT EXPLANATION HERE

While there is not a clear sign of over fitting, as the GAM models have lower testing errors than the multiple regression one, degree 3 seems to do the best, indicating that higher degrees might do some over fitting in comparison to that one, but the difference is small.

HW-3.2.c:

You now have six models (five GAM and one LM). Which model should be used? Explain your answer.

INSERT EXPLANATION HERE

The model GAM with degree 3 has the lowest RMSE for the testing error, also having the lowest degree (after 2). Both indicate that it probably extrapolates better to the real world and it is the simplest model (after the one of degree 2), which is also good. Additionally, it doesn’t show signs of overfitting.

HW-3.3: Boston housing

Use LASSO to predict housing prices in Boston

Consider the Boston data from the MASS package. We want to use LASSO to predict the median home value medv using all the other predictors.

HW-3.3.a:

Set up the LASSO and plot the trajectories of all coefficients. What are the last five variables to remain in the model?

Note: This was lab 3.1 question 2.

Code
# INSERT CODE HERE 

data("Boston")

# data as matrix
x <- as.matrix(Boston[, -which(names(Boston) == "medv")])
y <- Boston$medv

set.seed(444)

# split data and set as matrix
train_indices <- sample(1:nrow(Boston), size = 0.8 * nrow(Boston))

train_data <- Boston[train_indices, ]
test_data <- Boston[-train_indices, ]

x_train <- as.matrix(train_data[, -which(names(train_data) == "medv")])
y_train <- train_data$medv

x_test <- as.matrix(test_data[, -which(names(test_data) == "medv")])
y_test <- test_data$medv
Code
# INSERT CODE HERE 
# model
lasso_model <- glmnet(x_train, y_train, alpha = 1)

# plot
plot(lasso_model, xvar = "lambda", label = TRUE)
title("Trajectories of coefficients", line = 3)

Code
# get the lambdas
lambda_seq <- exp(seq(log(max(lasso_model$lambda)), log(min(lasso_model$lambda)), length.out = 100))

# to save the variables
last_five_vars <- character()

# for all lambdas
for (lambda in lambda_seq) {
    # get the coreficients for each
    lasso_coef <- predict(lasso_model, type = "coefficients", s = lambda)
    # make it as numeric
    coef_vector <- as.numeric(lasso_coef[-1, , drop = FALSE])
    # cannot be zero
    non_zero_coefs <- which(coef_vector != 0)
    if (length(non_zero_coefs) == 5) {
      # get last five without the intercept
        last_five_vars <- rownames(lasso_coef)[-1][non_zero_coefs]
        break}}

cat("Last five variables to remain:", paste(last_five_vars, collapse = ", "))
Last five variables to remain: chas, rm, ptratio, black, lstat

HW-3.3.b:

Find the 1SE value of \(\lambda\), using 10-fold cross-validation. What is the cross validation estimate for the residual standard error?

Code
# INSERT CODE HERE 

set.seed(444)
# model
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1, nfolds = 10)

# best lambda
lambda_min <- cv_lasso$lambda.min

# 1SE
lambda_1se <- cv_lasso$lambda.1se

# get the index
index_1se <- which.min(abs(cv_lasso$lambda - lambda_1se))

# select for that lambda
cvm_1se <- sqrt(cv_lasso$cvm[index_1se])

cat("Lambda 1-SE:", lambda_1se, "\n")
Lambda 1-SE: 0.3625409 
Code
cat("Cross-validation estimate for Residual Standard Error:", cvm_1se)
Cross-validation estimate for Residual Standard Error: 4.924468

HW-3.3.c:

Rescale all predictors so that their mean is zero and their standard deviation is 1. Then set up the LASSO and plot the trajectories of all coefficients. What are the last five variables to remain in the model? Compare your answer to part a.

Code
# INSERT CODE HERE 

# scale
x_train_scaled <- scale(x_train) 

set.seed(444) 
# model
lasso_model <- glmnet(x_train_scaled, y_train, alpha = 1)

# plot
plot(lasso_model, xvar = "lambda", label = TRUE)
title("Trajectories of coefficients", line = 3)

Code
# INSERT CODE HERE 

lambda_seq <- exp(seq(log(max(lasso_model$lambda)), log(min(lasso_model$lambda)), length.out = 100))

last_five_vars <- character()

for (lambda in lambda_seq) {
 
    lasso_coef <- predict(lasso_model, type = "coefficients", s = lambda)
    
    coef_vector <- as.numeric(lasso_coef[-1, , drop = FALSE])
    
    non_zero_coefs <- which(coef_vector != 0)
    
    if (length(non_zero_coefs) == 5) {
        last_five_vars <- rownames(lasso_coef)[-1][non_zero_coefs]
        break}}

cat("Last five variables to remain:", paste(last_five_vars, collapse = ", "))
Last five variables to remain: chas, rm, ptratio, black, lstat

The last 5 variables are still the same. However, rescaling them has changed their trajectory as seen in the plot and also makes them easier to visualize

HW-3.3.d:

Find the 1SE value of \(\lambda\) using 10-fold cross-validation. What is the cross validation estimate for the residual standard error now? Does rescaling lead to a better performing model?

Code
# INSERT CODE HERE 
set.seed(444)
lasso.out <- cv.glmnet(x_train_scaled, y_train, alpha = 1, nfolds = 10)

# extract the lambda value that is one standard error away from the minimum lambda
lambda_1se <- lasso.out$lambda.1se

# calculate the cross-validated RSE
rse <- sqrt(lasso.out$cvm[lasso.out$lambda == lambda_1se])

# print lambda value and RSE
cat("1SE Lambda:", lambda_1se, "\nRSE:", rse)
1SE Lambda: 0.3625409 
RSE: 4.924468

The error is the same and the lambda used too.

HW-3.4: Bike share usage

Predict bike share usage in Seoul using ridge and LASSO regressions

Access the dataset here. Filter the data to only include rows with “functioning days” == ‘Yes’. Next drop the columns Date, Hour, Seasons, and Holiday, and Functioning Day. Then drop any rows that have any missing values in any columns. Hint: You will need to rename some of the variable names because they include non-ASCII characters. This will help you later on.

Code
# GET DATA

# Obtaining data and column names
bike <- read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/00560/SeoulBikeData.csv', 
                 locale = locale(encoding = "UTF-8"), show_col_types = FALSE)

colnames(bike)
 [1] "Date"                         "Rented Bike Count"           
 [3] "Hour"                         "Temperature(\xb0C)"          
 [5] "Humidity(%)"                  "Wind speed (m/s)"            
 [7] "Visibility (10m)"             "Dew point temperature(\xb0C)"
 [9] "Solar Radiation (MJ/m2)"      "Rainfall(mm)"                
[11] "Snowfall (cm)"                "Seasons"                     
[13] "Holiday"                      "Functioning Day"             
Code
# Filtering only Functioning dat = yes
bike <- bike %>% filter(`Functioning Day` == "Yes")

# Manually changing the names (other forms of doing it were not supported due to format)
names(bike) <- c("Date", "Rented_Bike_Count", "Hour", "Temperature_C", 
                          "Humidity", "Wind_speed_ms", "Visibility_10m", 
                          "Dew_point_temperature_C", "Solar_Radiation_MJ_m2", 
                          "Rainfall_mm", "Snowfall_cm", "Seasons", 
                          "Holiday", "Functioning_Day")

# Drop unwanted columns
cols_to_drop <- c("Date", "Hour", "Seasons", "Holiday", "Functioning_Day")

bike <- bike %>% dplyr::select(-one_of(cols_to_drop))

# Drp NAs
bike <- na.omit(bike)

head(bike)
# A tibble: 6 × 9
  Rented_Bike_Count Temperature_C Humidity Wind_speed_ms Visibility_10m
              <dbl>         <dbl>    <dbl>         <dbl>          <dbl>
1               254          -5.2       37           2.2           2000
2               204          -5.5       38           0.8           2000
3               173          -6         39           1             2000
4               107          -6.2       40           0.9           2000
5                78          -6         36           2.3           2000
6               100          -6.4       37           1.5           2000
# ℹ 4 more variables: Dew_point_temperature_C <dbl>,
#   Solar_Radiation_MJ_m2 <dbl>, Rainfall_mm <dbl>, Snowfall_cm <dbl>

HW-3.4.a:

Run a linear regression to predict rented bike count using the remaining 8 variables in the dataset. Report the MSE and the most influential variables.

Code
# linear regression model
lm_model <- lm(Rented_Bike_Count ~ ., data = bike)

# extract MSE
mse <- mean(lm_model$residuals^2)

# get variable coefficients
coefs <- coef(lm_model)[-1]  # Exclude intercept

# identify influential variables (5)
most_influential <- names(sort(abs(coefs), decreasing = TRUE)[1:5])

# print MSE and vars 
print(paste("MSE:", mse))
[1] "MSE: 234111.046434225"
Code
print(cat("The 5 most influential variables:", most_influential))
The 5 most influential variables: Solar_Radiation_MJ_m2 Wind_speed_ms Rainfall_mm Snowfall_cm Temperature_CNULL

HW-3.4.b:

Fit a ridge regression model with the optimal \(\lambda\) chosen by cross validation. Report the CV MSE.

Code
# Data as matrix
x <- as.matrix(subset(bike, select = -c(Rented_Bike_Count)))
y <- bike$Rented_Bike_Count

# fit ridge with CV
ridge_mod <- cv.glmnet(x, y, alpha = 0)

# Optimal lambda
opt_lambda <- ridge_mod$lambda.min

# fit ridge model with optimal lambda
ridge_fit <- glmnet(x, y, alpha = 0, lambda = opt_lambda)

ridge_pred <- predict(ridge_fit, s = opt_lambda, newx = x)

# MSE
cv_mse <- min(ridge_mod$cvm)

# Print
cat("CV MSE:", cv_mse)
CV MSE: 236017

HW-3.4.c:

Perform the same fit using a LASSO regression this time. Choose the optimal \(\lambda\) using cross validation. Report on the remaining variables in the model and the CV MSE. How does this performance compare to ridge and a plain linear model?

Code
# INSERT CODE HERE 

# Fit Lasso
cv_lasso <- cv.glmnet(x, y, alpha = 1, type.measure = "mse")

# Optimal lambda
optimal_lambda_lasso <- cv_lasso$lambda.min
cat("Optimal lambda for LASSO:", optimal_lambda_lasso, "\n")
Optimal lambda for LASSO: 0.9379857 
Code
# MSE
cv_mse_lasso <- min(cv_lasso$cvm)
paste("CV MSE for LASSO with optimal lambda:", cv_mse_lasso)
[1] "CV MSE for LASSO with optimal lambda: 234598.796445585"
Code
# Model with optimal lambda
lasso_model <- glmnet(x, y, alpha = 1, lambda = optimal_lambda_lasso)

# Coefficients
coef_lasso <- coef(lasso_model)
# Remaining vars
remaining_vars <- rownames(coef_lasso)[coef_lasso[,1] != 0]

# Print remaining Vars
cat("Remaining variables in the LASSO model:\n", remaining_vars)
Remaining variables in the LASSO model:
 (Intercept) Temperature_C Humidity Wind_speed_ms Visibility_10m Solar_Radiation_MJ_m2 Rainfall_mm Snowfall_cm
Code
print("Coefficients:")
[1] "Coefficients:"
Code
print(lasso_coefs<-coef(cv_lasso, s = optimal_lambda_lasso, exact = TRUE))
9 x 1 sparse Matrix of class "dgCMatrix"
                                   s1
(Intercept)              8.523407e+02
Temperature_C            3.634304e+01
Humidity                -1.052416e+01
Wind_speed_ms            5.231777e+01
Visibility_10m           3.064328e-03
Dew_point_temperature_C  .           
Solar_Radiation_MJ_m2   -1.141101e+02
Rainfall_mm             -5.273742e+01
Snowfall_cm              3.350817e+01

This model has a lower MSE than Ridge but slightly higher than the plain linear model

HW-3.4.e:

Interpretation and communication. Write a short paragraph about your analysis and recommendations, explaining the most important factors for high bike share usage, why you came to that conclusion, and what actions can be taken by a bike rental company based on this information.

INSERT EXPLANATION HERE

By analyzing the data and all the models ran, the MSE is the lowest for the plain linear model, followed by the LASSO model and then the Ridge. This suggests that we would want to use the plain linear one probably, even though they all have very similar values. The most significant values are Temperature, Humidity, Wind Speed, Visibility, Solar Radiation, Rainfall, and Snowfall for the LASSO one, giving us some insight of the most important variables to predict rented bike count. We can see that Humidity, Solar radiation, and rainfall negatively impact the bike rentals while the rest mentioned affect it positively. This is useful information for the bike rental company to take into account when making decisions such as location where opening new rentals, amount of people to hire in different seasons, prepare differently depending on the weather forecast, or strategically place marketing campains.

HW-3.5: Splines

Compare the characteristics of two different smoothing splines

Consider two curves called \(\hat{g}_1\) and \(\hat{g}_2\) are as follows:

\[ \hat{g}_1 = argmin_g \left(\sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int [g^{(3)}(x)]^2 \right) \]

\[ \hat{g}_2 = argmin_g \left(\sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int [g^{(4)}(x)]^2 \right) \] where \(g^{(m)}\) represents the \(m^{th}\) derivative of \(g\).

HW-3.5.a:

As \(\lambda \to \infty\), which function (\(\hat{g_1}\) or \(\hat{g_2}\)) will have the smaller training RSS?

INSERT EXPLANATION HERE

As lambda goes to infinity, the penalty becomes dominant, pushing the derivatives (\(g^{(3)}\) for \(\hat{g}_1\) and \(g^{(4)}\) for \(\hat{g}_2\)) towards zero. However, for \(\hat{g}_1\) it forces the third derivative to 0 while for \(\hat{g}_2\) forces the forth derivative to 0. This makes a difference as it tends \(\hat{g}_1\) to a polynomial of degree 2 (quadratic) and \(\hat{g}_2\) to a polynomial of degree 3 (cubic). Thus, \(\hat{g}_2\) will be able to capture more noise probably and thus have a lower RSS.However, this always depends on the shape of the data and since they will be extremely smooth, they will probably have close values.

HW-3.5.b:

As \(\lambda \to \infty\), which function (\(\hat{g_1}\) or \(\hat{g_2}\)) will have the smaller test RSS?

INSERT EXPLANATION HERE

From before, we know what \(\hat{g}_1\) and \(\hat{g}_2\) tend to. Now, it is worth noting that as lambda goes to infinity, it makes them extremely smooth, probably causing under fitting. The simpler model may or may not extrapolate better to the real world and this, indeed, would depend on the nature of the data. However, the expected difference should be extremely small due to their extreme smoothness.

HW-3.5.c:

For \(\lambda = 0\), which function (\(\hat{g_1}\) or \(\hat{g_2}\)) will have the smaller training and test RSS?

INSERT EXPLANATION HERE

When lambda equals 0, the whole integral being multiplied by lambda sets to 0 as it is multiplied by 0 (lambda). This makes the penalty term 0 permitting the splines to fit the data as closely as posible. The term on the left of the equation of both (\(\hat{g_1}\) and \(\hat{g_2}\)) is exactly the same, meaning \(\hat{g_1}\) and \(\hat{g_2}\) will output the same result and thus, have the same RSS.

HW-3.6:

Explain the behavior of the curve for a variety of \(\lambda\) and \(m\) values.

Suppose a curve \(\hat{g}\) is fit smoothly to a set of \(n\) points as follows:

\[ \hat{g} = argmin_g \left(\sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int [g^{(m)}(x)]^2 \right) \]

where \(g^{(m)}\) is the \(m\)th derivative of \(\hat{g}\) and \(g^{(0)}=g\). Provide plots of \(\hat{g}\) in each of the following scenarios along with the original points provided.

Use the following starter code to make your set of points and plot your various model predictions.

Code
set.seed(325626)

X <- runif(100)
eps <- rnorm(100)
Y <- sin(12*(X + 0.2)) / (X + 0.2) + eps
generating_fn <- function(X) {sin(12*(X + 0.2)) / (X + 0.2)}
df <- data.frame(X, Y)

ggplot(df, aes(x = X, y = Y)) + 
  geom_point(alpha = 0.5) + 
  stat_function(fun = generating_fn, aes(col = "Generating Function")) + 
  scale_color_manual(values = "deepskyblue3") + 
  theme(legend.position = "right", legend.title = element_blank())

HW-3.6.a:

\(\lambda = \infty, m = 0\).

Code
# INSERT SOLUTION HERE 
# Small approximating infinity (biggest I could run)
lambda <- 1e8

# m
m <- 0

# fit smoothing spline to data
fit <- smooth.spline(df$X, df$Y, lambda = lambda)

# Predict using fitted smoothing spline with derivative order m
preds <- predict(fit, df$X, deriv = m)

# predictions as df
preds_df <- as.data.frame(preds)

# plot
ggplot(df, aes(x = X, y = Y)) + 
  geom_point(alpha = 0.5) +
  stat_function(fun = generating_fn, aes(col = "Generating Function")) + 
  geom_line(data = preds_df, aes(x = x, y = y, col = "Fitted Spline")) +
  theme(legend.position = "right", legend.title = element_blank()) 

HW-3.6.b:

\(\lambda = \infty, m = 1\).

Code
# INSERT SOLUTION HERE 

# Small approximating infinity (biggest I could run)
lambda <- 1e8

# m
m <- 1

# fit smoothing spline to data
fit <- smooth.spline(df$X, df$Y, df = 100, lambda = lambda)

# Predict using fitted smoothing spline with derivative order m
preds <- predict(fit, df$X, deriv = m)

# predictions as df
preds_df <- as.data.frame(preds)

# Plot
ggplot(df, aes(x = X, y = Y)) + 
  geom_point(alpha = 0.5) +  # plot original points
  stat_function(fun = generating_fn, aes(col = "Generating Function")) + 
  geom_line(data = preds_df, aes(x = x, y = y, col = "Fitted Spline")) +
  theme(legend.position = "right", legend.title = element_blank())

HW-3.6.c:

\(\lambda = \infty, m = 2\).

Code
# INSERT SOLUTION HERE 

# Small approximating infinity (biggest I could run)
lambda <- 1e8

# m
m <- 2

# fit smoothing spline to data
fit <- smooth.spline(df$X, df$Y, df = 100, lambda = lambda)

# Predict using fitted smoothing spline with derivative order m
preds <- predict(fit, df$X, deriv = m)

# predictions as df
preds_df <- as.data.frame(preds)

# Plot
ggplot(df, aes(x = X, y = Y)) + 
  geom_point(alpha = 0.5) +
  stat_function(fun = generating_fn, aes(col = "Generating Function")) + 
  geom_line(data = preds_df, aes(x = x, y = y, col = "Fitted Spline")) +
  theme(legend.position = "right", legend.title = element_blank())

HW-3.6.d:

\(\lambda = \infty, m = 3\).

Code
# INSERT SOLUTION HERE 

# Small approximating infinity (biggest I could run)
lambda <- 1e8

# m
m <- 3

# fit smoothing spline to data
fit <- smooth.spline(df$X, df$Y, df = 100, lambda = lambda)

# Predict using fitted smoothing spline with derivative order m
preds <- predict(fit, df$X, deriv = m)

# predictions as df
preds_df <- as.data.frame(preds)

# Plot
ggplot(df, aes(x = X, y = Y)) + 
  geom_point(alpha = 0.5) +
  stat_function(fun = generating_fn, aes(col = "Generating Function")) +
  geom_line(data = preds_df, aes(x = x, y = y, col = "Fitted Spline")) + 
  theme(legend.position = "right", legend.title = element_blank())

HW-3.6.e:

\(\lambda = 0, m = 3\).

Code
# INSERT SOLUTION HERE 

# lamda = 0
lambda <- 0

# m
m <- 3

# fit smoothing spline to data
fit <- smooth.spline(df$X, df$Y, lambda = lambda)

# Predict using fitted smoothing spline
preds <- predict(fit, df$X)

# predictions as df
preds_df <- as.data.frame(preds)

# Plot
ggplot(df, aes(x = X, y = Y)) + 
  geom_point(alpha = 0.5) +  
  stat_function(fun = generating_fn, aes(col = "Generating Function")) +  
  geom_line(data = preds_df, aes(x = x, y = y, col = "Fitted Spline")) + 
  theme(legend.position = "right", legend.title = element_blank()) 

HW-3.6.f:

Fit a smoothing spline on the dataset and report the optimal lambda

Code
# INSERT SOLUTION HERE 

# Optimal fit spline
optimal_fit <- smooth.spline(df$X, df$Y)

# optimal lambda defined by our fit
print(paste("Optimal lambda:", optimal_fit$lambda))
[1] "Optimal lambda: 0.000111660415400747"
Code
# Predict using fitted smoothing spline
preds <- predict(optimal_fit, df$X)

# predictions as df
preds_df <- as.data.frame(preds)

# Plot
ggplot(df, aes(x = X, y = Y)) + 
  geom_point(alpha = 0.5) + 
  stat_function(fun = generating_fn, aes(col = "Generating Function")) + 
  geom_line(data = preds_df, aes(x = x, y = y, col = "Fitted Spline")) + 
  theme(legend.position = "right", legend.title = element_blank())